Introduction

The “Spotify: All Time Top 2000 Mega Dataset” is a dataset from Kaggle that contains various audio statistics and ratings of the top 1,994 songs on Spotify. For each song, it includes information such as the Title, Artist, Top Genre (a very specific Spotify-labeled genre), Year of release, BPM (beats per minute), and Duration. The first three are nominal categorical variables, Year is an ordinal categorical variable, and the last two are quantitative variables. In addition, for each song, this dataset includes various quantitative ratings, such as those measuring its level of Energy, Danceability, Loudness, Liveness, Valence, Acousticness, Speechiness, and Popularity.

We added additional Genre, Decade, and Decade Range columns so that we could cluster songs into fewer groups, which in turn make visualizations more clear. The Genre column was created by manually sorting the 149 unique Top Genres into six main categories based on trends we saw within the data (Rock, Pop, Country, Hip Hop, Indie, and Other). Because of this manual sorting, there is a possibility of some human bias in results related to genre. Meanwhile, the Decade and Decade Range (two ranges, 1950s-1980s and 1990s-2010s) columns were able to be assembled programmatically.

Using this dataset, we will answer three main questions in our report:

Analysis

Question 1: Understanding Genre-Specific Attributes

In our analysis of genres, one of the first things we sought to analyze was how our quantitative variables varied individually by genre (our manually sorted categories of Rock, Pop, Country, Hip Hop, Indie, and Other). In our first pass at analyzing our ten quantitative variables according to genre, we created density plots, colored according to genre, for each of the ten variables. Note that we excluded the other category from this analysis, as the data in this category is too diverse to be of much use, and we judged that we probably would not be able to make any meaningful conclusions about it (as it includes such genres originally labelled as “blues”, “electro”, “reggae”, “streektaal”, etc.).

Exploratory density plots of quantitative variables

Exploratory density plots of quantitative variables

As we can see from the above density plots, BPM, Energy, Danceability, Valence and Popularity showed some clear differences between the genres, while the other five categories showed mostly similar trends between the five genres. For BPM, we can see two main peaks, with pop, indie, and hip-hop on the lower end, and country and rock having higher BPM’s. This seems to be a fairly surprising trend, as many tend to perceive pop and hip-hop songs as having faster tempos, contrary to what this graph shows. For Energy, we can again see two main peaks, with hip-hop being on the higher end and country and indie on the lower end. Again, this is a somewhat surprising trend after seeing the results of the BPM graph, since we assumed higher-paced songs would tend to have higher amounts of energy. The Danceability plot shows that all but hip-hop have a similar distribution in that category. For Valence, we can see that three of the genres (country, pop, rock) all have a similar distribution, while hip-hop and indie have different distributions (with slightly higher peaks). Lastly, Popularity seems to have mostly similar distributions, but with indie having a much lower peak than the rest. Overall, it seems like the five genres are mostly similar, but a few stand out in certain categories; in particular, hip-hop seems to stand out in many categories.

Next, we decided to take a look at a dendrogram using all 10 of the quantitative variables, to get a picture of how similar/different they were when taking all of the variables into account at the same time. In this dendrogram, five colors were used to color the branches because we have five genres; this was done so that if the genres tended to cluster together, it would make it more clear. Additionally, the other genre was again removed for the same reason as for the above graph.

In the dendrogram above, we have colored the leaves by genre in order to visualize how the data ended up clustering, in terms of clustering. We can see that each cluster tends to be somewhat uniformly divided between the genres. Most notably, the hiphop genre that we highlighted earlier does not appear to be significantly different from the other genres. From this graph, we can conclude that overall, when taking into account all of the quantitative variables, there does not appear to be much difference between the five genres. This plot does, however, confirm our suspicions that most of this dataset is composed of rock songs.

Lastly for this question, we looked at contour plots using the variables of interest that we highlighted from our density plots from above. Leaving out Popularity (as that is analyzed in the next question), the four variables that we looked at are BPM, Energy, Danceability, and Valence. We look at six contour plots below, accounting for all six pairings of these variables of interest:

Contour plots of quantitative variables of interest

Contour plots of quantitative variables of interest

These graphs provide a somewhat surprising conclusion: that despite our observations from the density plots, the five genres do not seem to differ much, if at all, when looking at two variables instead of one, even if just limited to the variables that seem to show differentiation. In the BPM vs. Energy, BPM vs. Danceability, Energy vs. Danceability, and Danceability vs. Valence graphs, there appears to be only one peak, centered around the middle, which means that there is little to no differentiation between the genres for these variables. The BPM vs. Valence graph, while showing two peaks, again does not really show too much differentiation, as those peaks are located fairly close to each other and on similary contour levels; furthermore, the genres seem very similarly distributed between them. The Energy vs. Valence graph is similar: three peaks, but close in distance and similarly distributed in genres.

Overall, we have found that though looking at one variable at a time seems to reveal some differences between the genres, it seems like when considering all them holistically, the genres in this dataset seem to be fairly similar, in terms of the ten quantitative variables.

Question 2: Exploring Popularity

Interested in making some popular music? The best way to learn how may be to study the professionals. What qualities do popular songs embody? By taking a look at the correlations between Popularity and other attributes, we can see what these popular songs have in common. We should start by investigating just popularity itself across all songs with univariate exploratory data analysis in the form of a histogram. This allows us to get a sense of what sort of popularity ratings are common and uncommon amongst songs.

Most songs seem to be concentrated in the middle half of the possible popularity range (between 25 - 75). There is definitely a left skew, with a unimodal peak around 65.

We want to look into associations, so let’s see how popularity is correlated with other quantitative variables through a pairs plot.

Pairs plot of all quantitative variables

Pairs plot of all quantitative variables

This pairs plot is incredibly messy. What we are really looking for is correlation between just Population and other individual variables.

BPM Energy Danceability Loudness Liveness Valence Duration Acousticness Speechiness
-0.00318 0.103 0.144 0.166 -0.122 0.0959 -0.0367 -0.0876 0.112

After isolating the correlation coefficient portion of the pairs plot, we can see how correlated Popularity is with the quantitative variables. As the magnitude of all of these correlation coefficients are under 0.2, we can state that Popularity is not associated with any of these quantitative variables.

Taking a look across genre, one clear trend seems to be that indie songs are less popular than those of the other genres. Indie’s second and third quartile range from around 30 to 45, where all other genre’s 25th percentile are above 45. Hip-hop appears to have the highest median popularity across the genres, but with some clear skew left so that its inner two quartiles overlap heavily with all non-indie genres. Overall, it is hard to judge any clear differences between the non-indie genres, but indie music is definitely less popular. This difference can be confirmed using a few statistical tests. We first used a One-Way ANOVA Test with the null hypothesis that the group means are equal. The p-value of the test is \(<2.2e-16\), meaning that there is sufficient evidence to reject the null hypothesis. We then performed multiple T-Tests between the popularity of indie songs and each of the remaining genres. In every case, the p-value is significant at a level \(\alpha = 0.05\), meaning that we are consistently able to reject the null hypothesis that the group means are equal.

If we want to look into what artists seem to have different ratings a popularity, one option is create two word clouds split on rating popularity. To make the two word clouds around equal in terms of amount of data, the less popular one has artists of songs with popularity less than or equal to 60 while the popular one has popularity greater than 60.

The larger the name on the word cloud, the more often they appeared in that subset of the data. Perhaps to our surprise, on the left word cloud, we see that Queen, ABBA, and Elvis Presley are associated with releasing less popular music. Meanwhile, Coldplay, The Beatles, Adele, and Michael Jackson are associated with releasing more popular music. Note that we can only interpret these results in terms of associations and not causations; emulating the style of one of the artists on the righthand word cloud is not guaranteed to make you a superstar, unfortunately.

Conclusions

Overall, there is not much of a difference between the five genres in terms of the quantiative variables in this dataset, aside from when they tended to be released. While we did observe some minor trends in Energy and Danceability where one or two genres somewhat differentiated themselves from the rest, we could not observe any clear difference between them. We were also able to conclude that Popularity is not associated with any of the other quantitative variables in the dataset. Indie music is less popular than other genres, but there seem to be no significant differences among the rest. Finally, we can conclude that a variety of qualitative and quantitative attributes have appeared to change over time, such as Top Genre and Danceability. Some of these changes are statistically-significant, but others are not.

One limitation to our analysis that we acknowledge is that for the PCA plot, its corresponding scree plot suggested that we should plot the first three dimensions, as the elbow occurred at k = 3. However, we did not include it because we were not able to figure out how to both add and interpret the addition of a third dimension. So, our PCA plot is most likely missing some information. In the future, we can research this further so that we can better represent these principal components. Another limitation comes from the data itself; there were many Dutch songs in the dataset even though after a thorough, manual research of their stats, they would not normally appear in a canonical Top 2000 songs listing. So, perhaps these are not truly the Top 2000 songs of all-time, but rather the Top 2000 songs of a Dutch-biased country or person. In the future, we could eliminate those songs entirely, or perhaps even address them more directly. Finally, as mentioned in our introduction, since we manually sorted the top genres into genres, there may have been some biased introduced in terms of where we ended up putting top genres that fit into multiple generic genre categories. Perhaps getting a music expert’s opinion could ensure the accuracy of these categorizations.

Additionally, in our future work, we look forward to is exploring these relationships with greater granularity and would be interested in experimenting with various subsets of the data to perform subgroup analyses. We did not do those things in our report because we aimed to answer overarching questions. Potential questions we can attempt to answer in the future include: “Is Dutch music similar to non-Dutch music?” and “Are there differences between different types of rock music?”